The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

نویسندگان

  • Huy Nguyen Anh Pham
  • Evangelos Triantaphyllou
چکیده

he Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining Huy Nguyen Anh Pham and Evangelos Triantaphyllou Department of Computer Science 298 Coates Hall, Louisiana State University Baton Rouge, LA 70803 Email: [email protected] and [email protected] Webpage: http://www.csc.lsu.edu/trianta Many classification studies often times conclude with a summary table which presents erformance results of applying various data mining approaches on different datasets. No single ethod outperforms all methods all the time. Furthermore, the performance of a classification ethod in terms of its false-positive and false-negative rates may be totally unpredictable. ttempts to minimize any of the previous two rates, may lead to an increase on the other rate. If he model allows for new data to be deemed as unclassifiable when there is not adequate nformation to classify them, then it is possible for the previous two error rates to be very low but, t the same time, the rate of having unclassifiable new examples to be very high. The root to the bove critical problem is the overfitting and overgeneralization behaviors of a given classification pproach when it is processing a particular dataset. Although the above situation is of undamental importance to data mining, it has not been studied from a comprehensive point of iew. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach alled the Homogeneity-Based Algorithm (or HBA) for optimally controlling the previous three rror rates. This is done by first formulating an optimization problem. The key development in his chapter is based on a special way for analyzing the space of the training data and then artitioning it according to the data density of different regions of this space. Next, the lassification task is pursued based on the previous partitioning of the training space. In this way, he previous three error rates can be controlled in a comprehensive manner. Some preliminary omputational results seem to indicate that the proposed approach has a significant potential to ill in a critical gap in current data mining methodologies. eywords: Data mining, classification, prediction, overfitting, overgeneralization, false-positive, alse-negative, homogenous set, homogeneity degree, optimization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization

According to the American Diabetes Association [3] in November 2007, 20.8 million children and adults in the United States (approximately 7% of the population) were diagnosed with diabetes. Thus, the ability to diagnose diabetes early plays an important role for the patient’s treatment process. The World Health Organization [4] proposed the eight attributes, depicted in Table 1, of physiologica...

متن کامل

Impact of Patients’ Gender on Parkinson’s disease using Classification Algorithms

In this paper the accuracy of two machine learning algorithms including SVM and Bayesian Network are investigated as two important algorithms in diagnosis of Parkinson’s disease. We use Parkinson's disease data in the University of California, Irvine (UCI). In order to optimize the SVM algorithm, different kernel functions and C parameters have been used and our results show that SVM with C par...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Data sanitization in association rule mining based on impact factor

Data sanitization is a process that is used to promote the sharing of transactional databases among organizations and businesses, it alleviates concerns for individuals and organizations regarding the disclosure of sensitive patterns. It transforms the source database into a released database so that counterparts cannot discover the sensitive patterns and so data confidentiality is preserved ag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008